-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric SDK - remove de-duplicating attributes for perf gain #1398
Metric SDK - remove de-duplicating attributes for perf gain #1398
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1398 +/- ##
=======================================
+ Coverage 57.3% 57.6% +0.2%
=======================================
Files 146 146
Lines 18179 18239 +60
=======================================
+ Hits 10422 10508 +86
+ Misses 7757 7731 -26 ☔ View full report in Codecov by Sentry. |
So going back to our conversation from before the gain here is not the 7x we saw. This is closer to a 6-8% improvement based on the stress test. I guess my interpretation was that we'd look at bounded instruments and/or allowing users to pass in prebuilt AttributeSets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment
There is no 7x gain with this PR. I am trying to understanding why did you expect 7x here? Apologies if I missed something obvious..
Yes absolutely! That is a separate improvement, helping users who has key/values known ahead. This PR is not attempting that improvement.
If users have duplicate keys in their attributes, that is already an indication of something wrong with their instrumentation. I don't think its fair to ask every user to pay for the perf penality to dedup! I am totally open to bringing de-dup as an opt-in feature, which is why I has opened this issue to gather feedback. |
TL;DR I think we should add the feature flag before merging this.
My point was maybe not well made.
I think we'd be better off focusing on providing those to provide performance. And if we're going to break functionality I really think we need to be more thoughtful about it. Yes we want performance, but if we are constantly breaking things people have little incentive to move forward and with our latest release I don't think we've done a good job of providing a path forward. This is a little wider point but it's relevant. I don't want to be the barrier to progress, but I also don't want no one to use this because they experience frustration through upgrades.
On the point above I think that we should have a feature flag on this to allow people to work with the old behavior. Feedback is great but we don't give people the option to try both ways as they would have to use an older version which might miss other internal improvements. |
@hdost I'm not entirely clear on who might be affected by this change. The API itself hasn't been changed; it's more a shift in behavior from not deduplicating in the SDK to doing so in the Prometheus exporter. Is it common for anyone to intentionally send duplicate attributes, expecting the SDK to handle the deduplication? Considering there are no extensibility points like SpanProcessor/LogRecordProcessor, it seems unlikely that additional attributes could be added even inadvertently, leading to duplication. Thus, the individual emitting the measurement should have complete control and can take steps to avoid passing duplicate keys. If there are other scenarios I'm not considering, I'm open to understanding more about them. From what I've seen in OTel .NET, which has never performed deduplication either in the SDK or Prometheus Exporter, this issue of duplication hasn't been a point of concern. I do understand the importance of being mindful with any changes that could potentially break existing workflows. However, in this instance, I'm not convinced that this change is of that nature. This is similar to the changes we've made previously with Logs and Traces. We didn't implement feature flags for traces or logs, and I'm not sure it's necessary for metrics either. I strongly believe that the default behavior should prioritize efficiency, and I fully support the idea of a feature flag or opt-in approach for the less efficient route involving deduplication, rather than defaulting to it. Would you reconsider your objection to this PR? |
I guess we're on the same page that there should be a feature flag. As I suggested it could be something like: |
@hdost Just to clarify (cause it was easy to lose in all the noise in the other PR), #1379 had a 6x improvement in the stress test because of the precaching of the hash values. This precaching gave the stress test extremely high gains on some machines (but not all) because it brought the hash computation out of the section of code under a mutex lock, thus allowed for the costly hash to be calculated in parallel under load. That optimization has been removed (to be opened in a new PR) but it's also a "best case" optimization that may not be realistic in applications that do more than just log metrics. |
Not fully! I agree to add a feature flag, if we know users are affected, not right away (i.e in this PR). Generally, I do not prefer adding feature-flag/options without some evidence that it is even needed. Same approach was already followed for logs and traces. |
Looking at some context for the attribute collections part of the spec from this PR where it was introduced, it does seem like the intention is for deduplicaiton to happen if possible, we could maybe have a feature flag to opt out for perf gains rather than opting in? |
That is effectively penalizing almost all(?) users, by doing de-dup logic in the hot path! I really hope we agree to do the default as the fastest! To unblock progress, are folks okay to observe feedback from #1300, and decide if we want to have a feature-flag, and determine if the feature-flag is enabled-by-default vs opt-in at the time we add feature-flag? |
Maybe it's prudent to have concrete idea of what the ramifications of not-deduplicating is? From my limited knowledge, there are two ramifications
What are the ultimate consequences of these? There's always a balance between performance and creating the pit of success for users, and it's hard for me to give much of an opinion without how accidentally violating the duplicate key rules in this case actually works. Fwiw, I don't have a problem enabling this to be opt-in as a "I am aware of how I build my attributes and explicitly know that providing duplicate attributes may cause bad data". I'm a big proponent of pit of success, and a feature flag to enable better performance without code changes doesn't seem too erroneous. |
It'd depend on the backend! Prometheus handles de-dup itself by doing ";" separated values, as shown in #1379 (comment) Metrics solutions inside my company (Microsoft) does overriding at ingestion. There could be other backends which does not handle this well. Hoping to hear about them from #1300 |
I guess my primary concerns is with If we don't dedup, then these are stored in the I assume the backends will automatically aggregate them into a single time series in that case? |
Again, this is backend specific. |
Closing for now, as we are exploring other options, and will come back to this after #1421 |
Fixes #1098. This adds to the gains achieved from #1379!
Similar to Tracing and Logs, this PR removes the attribute de-duplication from the SDK. Changelog is advising users to #1300, feedback from which can be used to evaluate if we need to bring this back as a opt-in feature in OTel SDK.
Changes
Remove de-deduplication of metric attributes from SDK.
Add test to validate the behavior and assert that it is a conscious choice.
Prometheus Exporter already handles this (when duplicates arise when sanitizing), but order is not guaranteed. Tests are modified to account for this.
Performance gains
20% gain for Counter.Add() hot path, when 4 attributes are used. More attributes will show more gains.
Numbers from my machine:
main branch
This PR
main branch
50% boost in the time needed for making AttributeSet from slice!
This PR:
Stress Tests:
main branch
this PR
Stress tests use only 3 attributes, so the improvement seems smaller compared
to benchmarks which uses 4 attributes!
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes